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Abstract 

We consider the problem of structure learning for bow-free acyclic 
path diagrams (BAPs). BAPs can be viewed as a generalization of 
linear Gaussian DAG models that allow for certain hidden variables. 
We present a first method for this problem using a greedy score-based 
search algorithm. We also prove some necessary and some sufficient 
conditions for distributional equivalence of BAPs which are used in an 
algorithmic approach to compute (nearly) equivalent model structures, 
allowing to infer lower bounds of causal effects. The application of our 
method to datasets reveals that BAP models can represent the data 
much better than DAG models in these cases. 


1 Introduction 

We consider learning the causal structure among a set of variables from ob¬ 
servational data. In general, the data can be modelled with a structural 
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equation model (SEM) over the observed and unobserved variables, which 
expresses each variable as a function of its direct causes and a noise term, 
where the noise terms are assumed to be mutually independent. The struc¬ 
ture of the SEM can be visualized as a directed graph, with vertices rep¬ 
resenting variables and edges representing direct causal relationships. We 
assume the structure to be recursive (acyclic), which results in a directed 
acyclic graph (DAG). DAGs can be understood as models of conditional 
independence, and many structure learning algorithms use this to find all 
DAGs which are compatible with the observed conditional independencies 


(Spirtes et al., 1993). Often, however, not all relevant variables are observed. 


The resulting marginal distribution over the observed variables might still 
satisfy some conditional independencies, but in general these will not have 
a DAG representation (Richardson and Spirtes, 2002). Also, there generally 
are additional constraints resulting from the marginalization of some of the 


variables (Evans, 2016; Shpitser et ah, 2014). 


In this paper we consider a model class which can accommodate cer¬ 
tain hidden variables. Specifically, we assume that the graph over the ob¬ 
served variables is a bow-free acyclic path diagram (BAP). This means it 
can have directed as well as bidirected edges (with the directed part being 
acyclic), where the directed edges represent direct causal effects, and the 
bidirected edges represent hidden confounders. The bow-freeness condition 
means there cannot be both a directed and a bidirected edge between the 
same pair of variables. The BAP can be obtained from the underlying DAG 
over all (hidden and observed) variables via a latent projection operation 
(Pearl, 2000) (if the bow-freeness condition admits this). We furthermore 
assume a parametrization with linear structural equations and Gaussian 
noise, where two noise terms are correlated only if there is a bidirected edge 
between the two respective nodes. In certain situations, it is beneficial to 
consider this restricted class of hidden variable models. Such a restricted 
model class, if not heavily misspecified, results in a smaller distributional 
equivalence class, and estimation is expected to be more accurate than for 
more general hidden variable methods like FCI (Spirtes et al. 1993), RFCI 


(Colombo et al., 2012), or FCI+ (Claassen et al 



The goal of this paper is structure learning with BAPs, that is, finding 
the set of BAPs that best explains some observational data. Just like in 
other models, there is typically an equivalence class of BAPs that are sta¬ 
tistically indistinguishable, so a meaningful structure search result should 
represent this equivalence class. We propose a penalized likelihood score 
that is greedily optimized and a heuristic algorithm (supported by some 
theoretical results) for finding equivalent models once an optimum is found. 
This method is the first of its kind for BAP models. 
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Figure 1: (a) DAG with hidden variables H±, H 2 , H 3 , (b) resulting BAP 
over the observed variables A]...., A 4 with annotated edge weights, and 
(c) resulting graph if A 3 is also not observed, which is not a BAP. 


Example of a BAP 


Consider the DAG in Figure la where we observe variables Ai,...,A 4 , 
but do not observe H\, H 2 , H 3 . The only (conditional) independency over 
the observed variables is X\ _LL A 3 | X 2 , which is also represented in the 
corresponding BAP in Figure lb The parametrization of this BAP would 
be 


Ai =€1 

X2 = B21 X± + 62 

A 3 = -B 32 A 2 + 63 
A 4 = -B 43 A 3 + 64 


with (ei, 62 , 63 , 64 ) t ~ jV( 0 , D) and 

/Du 0 0 0 \ 

Q _ 0 P22 0 P24 

0 0 D 33 0 

\ 0 D24 0 D44/ 


Hence the model parameters in this case are B 21 , B 32 , B 43 , Du, D 22 , D 33 , 
D 44 , and D 24 . An example of an acyclic path diagram that is not bow-free 


is shown in Figure lc 


Challenges 

The main challenge, like with all structure search problems in graphical 
modelling, is the vastness of the model space. The number of BAPs grows 
super-exponentially. Hence, exhaustively scoring all BAPs and finding the 
global score optimum is computationally infeasible. 

Another major challenge, specifically for our setting, is the fact that a 
graphical characterization of the (distributional) equivalence classes for BAP 
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models is not yet known. In the (unconstrained) DAG case, for example, 
it is known that models are equivalent if and only if they share the same 
skeleton and v-structures (Verma and Pearl, 1991). A similar result is not 


known for BAPs (or the more general acyclic directed mixed graphs). This 
makes it hard to traverse the search space efficiently, since one cannot search 
over the equivalence classes (like the greedy equivalence search for DAGs, see 
Ghickering (2002)). It also makes it difficult to evaluate simulation results, 


since the graphs corresponding to the ground truth and the optimal solution 
may be distinct and yet still represent the same model. 


Contributions 


We provide the first structure learning algorithm for BAPs. It is a score- 
based algorithm and uses greedy hill climbing to optimize a penalized like¬ 
lihood score. We are able to achieve a significant computational speedup 
by decomposing the score over the bidirected connected components of the 
graph and caching the score of each component. To mitigate the problem 
of local optima, we perform many random restarts of the greedy search. 

We propose to approximate the distributional equivalence class of a BAP 
by using a greedy strategy for likelihood scoring. If two BAPs are similar 
with respect to their penalized likelihoods within a tolerance, they should 
be treated as statistically indistinguishable and hence as belonging to the 
same class of (nearly) equivalent BAPs. Based on such greedily computed 
(near) equivalence classes, we can then infer bounds of total causal effects, 


in the spirit of Maathuis et al. (2009, 2010). 


We present some theoretical results towards equivalence properties in 
BAP models, some of which generalize to acyclic path diagrams. In par¬ 
ticular, we prove some necessary and some sufficient conditions for BAP 
equivalence. Furthermore, we present a Markov Chain Monte Carlo method 


for uniformly sampling BAPs based on ideas from Kuipers and Moffa (2015). 


We obtain promising results on simulated data sets despite the challenges 
listed above. Comparing the maximal penalized likelihood scores between 
BAPs and DAGs on real datasets shows a clear advantage of BAP models. 


Related Work 

There are two main research communities that intersect at this topic. On 


the one side there are the path diagram models, going back to 

Wright 

1934) 

and then being mainly developed in the behavioral sciences (, 

loreskog 

1970 

Duncan 

1975 

Glymour and Scheines 

1986; 

Joreskog 

200 ip. 

In this setting 


a model for the edge functions is assumed, usually a parametric model with 
linear edge functions and Gaussian noise. In the most general formulation, 
the graph over the observed variables is assumed to be an acyclic directed 
mixed graph (ADMG), which can have bows. While in general the pararn- 
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eters for these models are not identified, Drton et al. (2011) give necessary 


and sufficient conditions for global identifiability. Complete necessary and 
sufficient conditions for the more useful almost everwhere identifiability re¬ 
main unknown (however, see Foygel et al. (2012) for some necessary and 


some sufficient conditions). BAP models are a useful subclass, since they 
are almost everywhere identified (Brito and Pearl, 2002). Drton et al.| (2009) 


provided an algorithm, called residual iterative conditional fitting (RICF), 
for maximum likelihood estimation of the parameters for a given BAP. 

On the other side there are the non-parametric hidden variable mod- 


els, which are defined as marginalized DAG models 
marginalized distributions are constrained by condition 

(Pearl, 

2000 1 

The 

al independencies, as 

well as additional equality and inequality constraints ( 

Evans 

2016 

). When 


just modelling the conditional independence constraints, the class of max¬ 


imal ancestral graphs (MAGs) is sufficient (Richardson and Spirtes, 2002). 


Shpitser et al. (2014) have proposed the nested Markov model using AD- 
MGs to also include the additional equality constraints. Finally, mDAGs 


were introduced to model all resulting constraints (Evans 2016). In general 


BAPs induce independence constraints and also Vernra constraints, as well 
as other restrictions that do not apply in the non-parametric case. The BAP 
in Figure [Tb| for example, implies a Vernra constraint. Gaussian BAPs are 
also ‘maximal’, in the sense that every missing edge induces a constraint. In 
the non-parametric case, with each additional layer of constraints learning 
the graphical structure from data becomes more complicated, but at the 
same time more available information is utilized and a possibly more de¬ 
tailed structure can be learned. In the Gaussian case, however, all models 
are parameteric, and fitting BAPs that do not correspond to conditional 
independence models is essentially no different to fitting those that do. At 
the graphical level the search is perhaps easier, since we do not have to place 
the restriction of ancestrality on the structure of the graph. However, unlike 
for MAGs, the equivalence class of BAPs is not known, which means that 
one may end up fitting the same model multiple times in the form of dif¬ 
ferent graphs. Furthermore, BAPs are easier to interpret as hidden variable 
models. This can be seen when comparing the BAP in Figure lb with the 


corresponding MAG. The latter would have an additional edge between X\ 
and X 4 since there is no (conditional) independency of these two variables. 
As can be verified, the BAP and the MAG in this example are not distribu- 
tionally equivalent, since the former encodes additional non-independence 
constraints. 


Structure search for MAGs can be done with the FCI (Spirtes et al. 


1993), RFCI (Maathuis et al. 2009), or FCI+ (Claassen et al. 2013) al- 


1 Strictly speaking, not all SEMs with correlated Gaussian errors can be interpreted 
as latent variable models, since the latent variable models have additional inequality con¬ 


straints. We do not discuss this further here, but see Fox et al. (2015) for more details. 
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gorithms. Silva and Ghahramani (2006) propose a fully Bayesian method 


for structure search in linear Gaussian ADMGs, sampling from the poste¬ 
rior distribution using an MCMC approach. Shpitser et al. (2012) employ a 


greedy approach to optimize a penalized likelihood over ADMGs for discrete 
parametrizations. 


Outline of this Paper 

In Section [2] we give an in-depth overview of the model and its estimation 
from data, as well as some distributional equivalence properties. In Section [3] 
we present the details of our greedy algorithm with various computational 
speedups. In Section [4] we present empirical results on simulated and real 
datasets. All proofs as well as further theoretical results and justifications 
can be found in the Appendix. 


2 Model and Estimation 

2.1 Graph Terminology 

Let Xi,, Xd be a set of random variables and V = (1,..., d} be their 
index set. The elements of V are also called nodes or vertices. A mixed 
graph or path diagram G on V is an ordered tuple G = (V, Eb, Eb) for 
some Ed,Eb C V x V \ {(i,i) \ i £ V}. If (i,j) £ Ed, we say there is 
a directed edge from i to j and write i —> j £ G. If (i,j) £ Eb, we must 
also have (j, i) £ Eb, and we say there is a bidirected edge between i and 
j and write i j £ G. The set pa G (i) := {j \ j —> i £ G} is called the 
parents of i. This definition extends to sets of nodes S in the obvious way: 
pa G (S) := \J ieS pa G (i). The in-degree of i is the number of arrowheads at i. 
If V' C V, E' d C Eb\v'xV', and E' B C Eb\v'xV', then G’ = (V, E' D , E' B ) is 
called a subgraph of G, and we write G' C G. The induced subgraph Gw for 
some vertex set W C V is the restriction of V to vertices W. If G' C G but 
G' 7 ^ G, we call G' a strict subgraph of G and write G' C G. The skeleton 
of G is the undirected graph over the same node set V and with edges i — j 
if and only if i —> j £ G or * o j £ G (or both). 

A path it between i and j is an ordered tuple of (not necessarily distinct) 
nodes tt = (vq = i,... ,vi = j) such that there is an edge between Vk and 
Vk+i for all k = 0,..., l — 1. If the nodes are distinct, the path is called 
non-overlapping (note that in the literature a path is mostly defined as non¬ 
overlapping). The length of n is the number of edges X(tt) = l. If 7r consists 
only of directed edges pointing in the direction of j, it is called a directed 
path from i to j. A node j on a non-overlapping path 7r is called a collider 
if 7T contains a non-overlapping subpath ( i,j,k ) with two arrowheads into 
^ Otherwise j is called a non-collider on the path. If j is a collider on a 

2 That is, one of the following structures: —«-»•<-». 
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non-overlapping path ( i,j,k ), we call ( i,j,k ) a collider triple. Moreover, if 
( i,j,k ) is a collider triple and i and fc are not adjacent in the graph, then 
( i,j,k ) is called a v-structure. A path without colliders is called a freA;. 

Let A,B,C C h be three disjoint sets of nodes. The set an(C) := 

C U {i E V | there exists a directed path from i to c for some c E C} is 
called the ancestors of C. A non-overlapping path it from a E A to b E B is 
said to be m-connecting given C if every non-collider on it is not in C and 
every collider on n is in an(C). If there are no such paths, A and B are 
m-separated given C, and we write A _LL m B \ C. We use a similar notation 
for denoting conditional independence of the corresponding set of variables 
Xa and Xb given Xq~. Xa JL Xb \ Xc- 

A graph G is called cyclic if there are at least two distinct nodes i and j 
such that there are directed paths both from i to j and from j to i. Otherwise 
G is called acyclic or recursive. An acyclic path diagram is also called an 
acyclic directed mixed graph (ADMG). An acyclic path diagram having at 
most one edge between each pair of nodes is called a bow-fre^\ acyclic path 
diagram (BAP). An ADMG without any bidirected edges is called a directed 
acyclic graph (DAG). 

2.2 The Model 

A linear structural equation model (SEM) M is a set of linear equations in¬ 
volving the variables X = (AT,..., Xd) T and some error terms e = (ei,..., ed) T . 

X = BX + e, (1) 

where B is a real matrix, cov(e) = 0 is a positive semi-definite matrix, and 
we assume that all variables X have been normalized to mean zero. M has 
an associated graph G that reflects the structure of B and 0. For every 
non-zero entry Bjj there is a directed edge from j to i, and for every non¬ 
zero entry Q t j there is a bidirected edge between i and j. Thus we can also 
write 0 as: 

Xi = ^ BijXj + €i, for all i € V, (2) 

j€pa c (i) 

with cov(ej, €j) = il t j for all i. j 6 V. 

Our model is a special type of SEM, which we refer to as Gaussian BAP 
modeQ In particular, we make the following assumptions: 

(Al) The errors e follow a multivariate Normal distribution A/"(0, Q). 

(A2) The associated graph G is a BAP. 

3 The structure i —> j together with i «->■ j is also known as bow. 

4 A11 BAP models in this paper are assumed to have a Gaussian parametrization unless 
otherwise stated. 
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Often M is specified via its graph G, and we are interested to find parameters 
6q compatible with G. We thus define the parameter spaces for the edge 
weight matrices B (directed edges) and 12 (bidirected edges) for a given BAP 
G as 


Bq = {B G M. dxd | Bij = 0 if j -G i is not an edge in G} 

Og = {hi G M. dxd | Qjj = 0 if i j and i «->■ j is not an edge in G; 

17 is symmetric positive semi-definite} 

and the combined parameter space as 


e G = b g x o G . 


The covariance matrix for X is given by: 

m = (I-B)~ 1 n(I-B)- T , (3) 


where <f> : @g —> <Sg maps parameters to covariance matrices, and Sg ■= 
({)(Qg) is the set of covariance matrices compatible with G. Note that ) 
exists since G is acyclic by |(A2)| and therefore I — B is invertible. 

We assume that the variables are normalized to have variance 1, that 
is, we are interested in the subset Sg C Sg, where Sq = {£ £ Sg \ £a = 
1 for all* = 1,..., d}, and its preimage under cj>, ®g '■= ft" 1 (<5g) C @g- 
One of the main motivations of working with BAP models is parameter 
identihability. This is defined below: 


Definition 1. A normalized parameter Og & ®g is identifiable if there is 
no 0' G G ®g such that Oq 0' G and 4>(0 g) = fi{d' G ). 


Brito and Pearl (2002) show that for any BAP G, the set of normalized 


non-identifiable parameters has measure zero. 

The causal interpretation of BAPs is the following. A directed edge 
from X to Y represents a direct causal effect of X on Y. A bidirected edge 
between X and Y represents a hidden variable which is a cause of both X 
and Y, see also Figure[l} In practice, one is often interested in predicting the 
effect of an intervention at X, on another variable Xj. This is called the total 
causal effect of Xj on Xj and can be defined as E %3 = ^E[X 3 | do(Xi = x)}, 
where the do(Xi = x) means replacing the respective equation in ([2]) with 
Xi = x (Pearl , 12000). For linear Gaussian path diagrams such as in 0 or 
Q, this is a constant quantity given by 




( 4 ) 
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2.3 Penalized Maximum Likelihood 


Consider a BAP G. A first objective is to estimate the parameters 6 q from n 
i.i.d. samples of model ([ 2 ]), denoted by {x^ } (i = 1 ,..., d and s = 1 ,..., n). 
This can be done by maximum likelihood estimation using the RICF method 
of Drton et al. (2009). Given the Gaussianity assumption |(A1)| and the 
covariance formula (13]) , one can express the log-likelihood for some given 
parameters Oq and the sample covariance matrix S as: 


l(e G ; S) = ~ (log \ 2ttY,q\ + ^(E^S) 


(5) 


where Eg = 4>(9g) is the covariance matrix implied by parameters do, see 
for example Mar dia et al.| (1979, (4.1.9)). However, due to the structural 
constraints on B and G it is not straightforward to maximize this for 9q- 
RICF is an iterative method to do so, yielding the maximum likelihood 
estimate: 


9q = arg max ((S'; @g). (6 ) 

We now extend this to the scenario where the graph G is also unknown, 
using a regularized likelihood score with a BIC-like penalty term that in¬ 
creases with the number of edges. Concretely, we use the following score for 
a given BAP G: 

s(G ) := — ( max 1(S-,9g) — (#{nodesl + #{edges}) logn ] . (7) 

n \ 0 G ee G ) 


We have scaled the log-likelihood and penalty with 1/n so that the score is 
expected to be 0(1) as n increases. Compared with the usual BIC penalty, 
we chose our penalty to be twice as large, since this led to better performance 
in simulations studies. The number of nodes is typically fixed, so does not 
matter for comparing graphs over the same vertex set. We included it to 
make explicit the penalization of the model parameters (which correspond 
to nodes and edges). 

In our search for the true causal graph G, we assume that if E £ Sh for 
any other graph H, then Sh 2 Sg, that is H represents a strict supermodel 
of G. This rules out the possibility that ‘by chance’ we land on a distribution 
contained in a submodel, and is a minimal requirement for causal learning. 
The set of matrices that violate the requirement has measure zero within the 
model Sq (assuming entries in B and C are generated according to a posi¬ 
tive joint density with respect to the Lebesgue measure). This requirement 
is analogous to the faithfulness assumption of Spirtes et al. (1993), though 
faithfulness applies separately to individual conditional independence con¬ 
straints rather than to the entire model. 
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2.4 Equivalence Properties 


There is an important issue when doing structure learning with graphi¬ 
cal models: typically the maximizers of ([7]) will not be unique. This is 
a fundamental problem for most model classes and a consequence of the 
model being underdetermined. In general, there are sets of graphs that are 
statistically indistinguishable (in the sense that they can all parametrize 
the same joint distributions over the variables). These graphs are called 
distributionally equivalent. For nonparametric DAG models (without non¬ 
linearity or non-Gaussianity constraints), for example, the distributional 
equivalence classes are characterized by conditional independencies and are 
called Markov equivalence classes. For BAPs, distributional equivalence is 
not completely characterized yet (see Spirtes et al. (1998) or Williams (2012) 
for a discussion of the linear Gaussian ADMG case), but we present some 
necessary and some sufficient conditions, that can be used to simplify struc¬ 
ture search in practice. Let us first make precise the different notions of 
model equivalence. 


Definition 2. Two BAPs G\,G 2 over a set of nodes V are Markov equiv¬ 
alent if they imply the same m-separation relationships. 


This essentially means they imply the same conditional independencies, 
and the definition coincides with the classical notion of Markov equivalence 
when G\ and G 2 are both DAGs. The following notion of distributional 
equivalence is stronger. 

Definition 3. Two BAPs G\,G 2 are distributionally equivalent if Sq 1 = 

Sg 2 - 

We now present some sufficient and some necessary conditions for dis¬ 
tributional equivalence in BAP models. 


2.4.1 Necessary Conditions 


Spirtes et al. (1998) showed the following global Markov property for general 


linear path diagrams: if there are nodes a,b G V and a possibly empty set 
C C V such that a _LL m b \ C , then the partial correlation of X a and X & 
given Xq is zero. In addition, if such an m-separation does not hold then 
the partial correlation is non-zero for almost all distributions. As a direct 
consequence, we get the following first result: 


Lemma 1. If two BAPs G\,G 2 do not share the same m-separations, they 
are not distributionally equivalent. 


Unlike for DAGs, the converse is not true, as the counterexample in 
Figure [3] shows. For DAGs it is trivial to show that having the same skeleton 
is necessary for Markov equivalence, since a missing edge between two nodes 
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Figure 2: The two BAPs in (a) and (b) differ only in the direction of the 
X\, X 3 edge; both have no nr-separations, and every induced subgraph 
leads to models which are distributionally equivalent. However, by using 
the results of Shpitser et al. (2014) one can show that these models are not 
distributionally equivalent. 


means they can be d-separated, and thus a conditional independency would 
have to be present in the corresponding distribution. For BAPs a missing 
edge does not necessarily result in an m-separation, as the counterexample 
in Figure [3] shows. However, the following result will allow us to improve 
the necessary condition of same nr-separations for BAPs to the same as for 
DAGs. 

Theorem 1. Let G and G' be distributionally equivalent BAPs on vertices 
V. Then, for any subset W C V, the induced subgraphs Gw and G' w are 
also distributionally equivalent. 

If we in particular look at the induced subgraphs of size two and three 
we obtain the following necessary conditions for distributional equivalence. 

Corollary 1. Let G and G' be distributionally equivalent BAPs. Then they 
have the same skeleton and v-structures. 


Since m-separations are not fully determined by the skeleton and the v- 
structures of a graph, it is also worthwhile to look at larger subgraphs. This 
leads, for example, to the following result: if two graphs are distributionally 
equivalent and a particular path is a so-called discriminating path in both 
graphs, then the discriminated triple will be a collider in both or in neither 


(see Ali et al. 2009 Section 3.4). 


The criteria given above are not complete, in the sense that there exist 
BAPs that are not distributionally equivalent and yet this cannot be proven 
by applying these results. For example, the BAPs in Figure [2] are not dis¬ 
tributionally equivalent, which can be shown using the results of Shpitserj 
et al. (2014). However, they both have no m-separations, and for every pair 


of induced subgraphs the BAPs obtained are distributionally equivalent. A 
complete characterization remains an open problem. 
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Figure 3: The two BAPs in (a) and (b) share the same skeleton and v- 
structures, but in (a) there are no m-separations, whereas in (b) we have 
X 2 -1L m X 3 | { X\,X 4 }. BAPs (a) and (c) share the same m-separations 
(none) but are not distributionally equivalent since they have different skele¬ 
tons (using Corollary [lj) . 


2.4.2 Sufficient Conditions 


To prove sufficient conditions, we first give a characterization of the equiva¬ 
lence class in terms of treks (collider-free paths) using Wright’s path tracing 


formula (Wright, 1960). Wright’s formula expresses the covariance between 


any two variables in a path diagram as the sum-product over the edge labels 
of the treks between those variables, as long as all variables are normalized to 
variance 1. A precise statement as well as a proof of a more general version of 
Wright’s formula can be found in the Appendix (Theorem [4]). As an exam¬ 
ple, consider the BAP in Figure [Tbj There are two treks between X 2 and X 4 : 


X 2 —> X 3 —»• X 4 and X 2 ■H- X 4 . Hence cov(AT 2 , X 4 ) = -B 32 -B 43 + H 24 , assum¬ 
ing normalized parameters. Similarly we have cov(Xi, X 4 ) = -B 21 H 32 -B 43 . 

As a consequence of Wright’s formula, we can show that having the same 
skeleton and collider triples is sufficient for two acyclic path diagrams to be 
distributionally equivalent: 


Theorem 2. Let G\,G 2 be two acyclic path diagrams that have the same 
skeleton and collider triples. Then G\ and G 2 are distributionally equivalent. 

For DAGs, it is known that the weaker condition of having the same 
skeleton and v-structures is sufficient for being Markov equivalent. For BAPs 
this is not true, as the counterexample in Figure [3] (together with Lemma[lJ 
shows. 

We therefore have that the distributional equivalence class of a BAP G: 


• is contained in the set of BAPs with the same skeleton and v-structures 
as G and 


• contains the set of BAPs with the same skeleton and collider triples 
as G. 


We know that the first relation is strict by the counterexample mentioned 
above and have strong evidence that the second relation is strict as well (Now- 
zohour, 2015, Appendix B). 
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3 Greedy Search 


We aim to find the maximizer of 0 over all graphs over the node set 
V = {1,...,g£}. Since exhaustive search is infeasible, we use greedy hill¬ 
climbing. Starting from some graph G°, this method obtains increasingly 
better estimates by exploring the local neighborhood of the current graph. 
At the end of each exploration, the highest-scoring graph is selected as the 
next estimate. This approach is also called greedy search and is often used 
for combinatorial optimization problems. Greedy search converges to a local 
optimum, although typically not the global one. To alleviate this we repeat 
it multiple times with different (random) starting points. 

We use the following neighborhood relation. A BAP G' is in the local 
neighborhood of G if it differs by exactly one edge, that is, the number of 
edges differs by at most one, and one of the following holds: 

1. G C G' (edge addition), 

2 . G'cG (edge deletion), or 

3. G and G' have the same skeleton (edge change). 


If we only admit the first condition, the procedure is called forward search , 
and it is usually started with the empty graph. Instead of searching through 
the complete local neighborhood at each step (which can become prohibitive 
for large graphs), we can also select a random subset of neighbors and only 
consider those. 

In Sections 3T and T2 we describe some adaptations of this general 
scheme, that are specific to the problem of BAP learning. In Section 3.3 we 
describe our greedy equivalence class algorithm. 


3.1 Score Decomposition 


Greedy search becomes much more efficient when the score separates over the 
nodes or parts of the nodes. For DAGs, for example, the log-likelihood can 
be written as a sum of components, each of which only depends on one node 
and its parents. Hence, when considering a neighbor of some given DAG, 
one only needs to update the components affected by the respective edge 
change. A similar property holds for BAPs. Here, however, the components 
are not the nodes themselves, but rather the connected components of the 
bidirected part of the graph (that is, the partition of V into sets of vertices 
that are reachable from each other by only traversing the bidirected edges). 


For example, in Figure lb the bidirected connected components (sometimes 


also called districts) are {Xi}, {X 2 , X 4 }, {X 3 }. This decomposition property 
is known (Tian, 2005; |Richardson 2009), but for completeness we give a 
derivation in the Appendix. We write out the special case of the Gaussian 
parametrization below. 
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Let us write Pq for the joint density of X under the model ([2|, and p G 
for the corresponding joint density of e. Let C\,.... Ck be the connected 
components of the bidirected part of G. We separate the model G into 
submodels G\,..., Gk of the full SEM Q , where each Gk consists only of 
nodes in V/ c = C/ c Upa(C'fc) and without any edges between nodes in pa(Cfc). 
Then, as we show in the Appendix, the log-likelihood of the model with 
joint density p G given data T> = {x^} (with 1 < i < d and 1 < s < n) can 
be written as: 


K'Pg- v ) = 

3=1 


E 


KPG k ;{4 s) Yievr' 


l )- E 1 Mb»{ X 3 

jepa(C fe ) 




where l ( p Gk ; {x^ } s 1,, " ,n ) refers to the likelihood of the Xj-marginal of p Gk 
For our Gaussian parametrization, using © , this becomes 


Z(£ Gl ,...,S Gx ;S) = 

_ ?E f I C'fcl log 2 tt + log ^^ 2 +E~tr(S^5 Gfc -|pa(C , fc )|)^ , 
1 k V liiepa(C fc )%- n ) 


where Sc k is the restriction of S to the rows and columns corresponding to 
Ck, and crjh is the diagonal entry of S Gfc corresponding to parent node j. 

Note that now the log-likelihood depends on {E' ) } and p G only via S and 
E Gl ,..., E Gk . Furthermore, the log-likelihood is now a sum of contribu¬ 
tions from the submodels Gk- This means we only need to re-compute the 
likelihood of the submodels that are affected by an edge change when scor¬ 
ing the local neighborhood. In practice, we also cache the submodel scores, 
that is, we assign each encountered submodel a unique hash and store the 
respective scores, so they can be re-used. 


3.2 Uniformly Random Restarts 

To restart the greedy search we need random starting points (BAPs), and it 
seems desirable to sample them uniformly at randon^] Just like for DAGs, 
it is not straightforward to achieve this. What is often done in practice is 
uniform sampling of triangular (adjacency) matrices and subsequent uniform 
permutation of the nodes. However, this does not result in uniformly dis¬ 
tributed graphs, since for some triangular matrices many permutations yield 
the same graph (the empty graph is an extreme example). The consequence 

5 Another motivation for uniform BAP generation is generating ground truths for sim¬ 
ulations. 
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n=30000; h=-4.1271 


o Naive (LL/n=-4.393) 
A MCMC (LL7n=-4.128) 
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BAP ID 


Figure 4: Relative frequencies of the 62 3-node BAPs when sampled 
30000 times with the “naive” (triangular matrix sampling) and the MCMC 
method. 


is a shift of weight to more symmetric graphs, that are invariant under some 
permutations of their adjacency matrices. A simple example with BAPs for 
d = 3 is shown in Figure [4j One way around this is to design a random 
process with graphs as states and a uniform limiting distribution. A cor¬ 
responding Markov Chain Monte Carlo (MCMC) approach is described for 
example in Melangon et al. (2001) for the case of DAGs. See also Kuipers| 


and Moffa (2015) for an overview of different sampling schemes. 


We adapted the MCMC algorithm for BAPs as described below. 


Algorithm 1. Let Gk = (V, Ed, E b ) be the BAP of the current MCMC 
iteration. Let ( i,j ) G V x V\ {(z, i) \ i G V} be a position sampled uniformly 
at random and let a G Bernoulli( 0.5) be a single Bernoulli draw. We then 
form Gk +i = (V, E' d , E' b ) by applying the following rules. 

1. If there is an edge at ( i,j ) (i.e. if (i,j) G Ed or ( j,i ) G Ed or ( i,j) G 
Eb ), and 


(a) if a = 0: remove the edge (i.e. E' D = Ed \ {(z, j), (j, *)}, E' B = 

(b) if a = 1: do nothing. 


2. If there is no edge at ( i,j) (i.e. if ( i,j ) ^ Ed and ( j,i ) ^ Ed and 
(i,j) E b ), and 

(a) if a = 0: add i —f j (i.e. E' D = Ed U {(z, j)}, E' B = E B ) as long 
as this does not create a directed cycle, otherwise do nothing; 

(b) if a = 1: add i j (i.e. E' D = E D , E' B = E B U {(*, j), (. j,i)}). 
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It is easy to check that the resulting transition matrix is irreducible and 
symmetric (see Appendix |A.3[ ) and hence the Markov chain has a (unique) 
uniform stationary distribution. Thus, starting from any graph, after an 
initial burn-in period, the distribution of the visited states will be approxi¬ 
mately uniform over the set of all BAPs. In practice, we start the process 


from the empty graph and sample after taking 0(d 4 ) steps (c.f. Kuipers and 
Moffa| ( fMTsl ) ). 

It is straightforward to adapt this sampling scheme to a number of con¬ 
straints, for example uniform sampling over all BAPs with a given maximal 
in-degree. 


3.3 Greedy Equivalence Class Construction 

We propose the following recursive algorithm to greedily estimate the dis¬ 
tributional equivalence class EC(G) of a given BAP G with score C t . We 
start by populating the empirical equivalence class EC(G ) with graphs that 
have the same skeleton and colliders as G, since these are guaranteed to be 
equivalent by Theorem [2] This is a significant computational shortcut, since 
these graphs do not have to be found greedily anymore. Then, starting once 
from each of the graphs in EC(G ) found above, at each recursion level we 
search all edge-change neighbors of the current BAP for BAPs that have a 
score within e of £ (edge additions or deletions would result in non-equivalent 
graphs by Corollary [I]). For each such BAP, we spark a new recursive search 
until a maximum depth of d(d— l)/2 (corresponding to the maximum num¬ 
ber of possible edges) is reached, and always comparing against the original 
score C- Already visited states are stored and ignored. Finally, all found 
graphs are added to EC{G). The main tuning parameter here is e, essen¬ 
tially specifying the threshold for statistically indistinguishable graphs. This 
results in conservative estimates for total causal effects using the methods 


discussed in Section |4.1| by also including neighboring equivalence classes, 
that are statistically indistiguishable from the given data. 


3.4 Implementation 

Our implementation is done in the statistical computing language R (R Core] 


Team, 2015). The code is available by the corresponding author on request, 


and it will also be published in a future version of the R package pcalg 
(Kalisch et al., 2012). We make heavy use of the RICF implementation 


f itAncestralGrapljjin the ggm package (Marchetti et al., 2015). We noted 


that there are sometimes convergence issues, so we adapted the implemen¬ 
tation of RICF to include a maximal iteration limit (which we set to 10 by 
default). 


b Despite the function name the implementation is not restricted to ancestral graphs. 
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4 Empirical Results 


In this section we present some empirical results to show the effectiveness of 
our method. First, we consider a simulation setting where we can compare 
against the ground truth. Then we turn to a well known genomic data set, 
where the ground truth is unknown, but the likelihood of the fitted models 
can be compared against other methods. 


4.1 Causal Effects Discovery on Simulated Data 

To validate our method, we randomly generate ground truths, simulate data 
from them, and try to recover them from the generated datasets. This pro¬ 
cedure is repeated N = 100 times and the results are averaged. We now 
discuss each step in more detail. 


Randomly generate a BAP G. 

We do this uniformly at random (for a fixed model size d = 10 and maximal 


in-degree a = 2). The sampling procedure is described in Section 3.2 


Randomly generate parameters 6 q. 

We sample the directed edge labels in B independently from a standard Nor¬ 
mal distribution. We do the same for the bidirected edge labels in P, and 
set the error variances (diagonal entries of Q) to the respective row-sums of 
absolute values plus an independently sampled y 1 2 * (l) valu(0 


Simulate data {x^} from Bq- 

This is straightforward, since we just need to sample from a multivariate 
Normal distribution with mean 0 and covariance 4>(9g)- We use the func¬ 


tion rmvnormO from the package mvtnorm (Genz et ah, 2014) 


Find an estimate G from {x^}. 

We use greedy search with R = 100 uniformly random restarts (as outlined 
in Section [3]), as well as one greedy forward search starting from the empty 
model. 


Compare G and G. 

A direct comparison of the graphs does not make sense since they could 
be different but in the same equivalence class. We therefore estimate the 
equival ence classes of both G and G using the greedy approach described in 


Section 


3.3 


with e = 10 10 to get EC(G) and EC(G). 


1 By Gershgorin’s circle theorem, this is guaranteed to result in a positive definite 

matrix. To increase stability, we also repeat the sampling of fl if its minimal eigenvalue 

is less then 10~ 6 . 
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Since the estimated equivalence classes are empirical, it is not straight¬ 
forward to compare them. For one, they might be intersecting, but not equal 
(if the recursion level was set too low and they were started from different 
graphs for example). More relevantly, they might be entirely different, but 
still agree in large areas of the graph. We therefore chose to evaluate not 
the graph structure but the identifiability of causal effects. Often this is 


also more relevant in practice. Maathuis et al. (2009) developed a method 


(which they called IDA) to find identifiable causal effects in a multiset of 
DAGs. We apply the same idea in our setting. Specifically, this means we 
estimate the causal effects matrix E for each graph G' £ EC(G ) (using the 
estimated parameters 9q> = (B 1 , Q') and Q). We then take absolute values 
and take the entry-wise minima over all E to obtain Eq 171 , the minimal 
absolute causal effects matrix (if an entry Eij is nonzero, there is a nonzero 
causal effect from X{ to Xj for every graph in the equivalence class). We do 
the same for G to get E'X n . 

What is left is to compare the minimal absolute causal effects matrix 
Eq m of the ground truth to the minimal absolute causal effects matrix 
Emin Q f es ti ma te. Thus, our target set consists of all pairs such 

that ( EQ in )ij > 0. We score the pairs according to our estimated E™ n 
values, and we report the area under the ROC curve (AUC, see Hanley and 


McNeil (1983)). The AUC ranges from 0 to 1, with 1 meaning perfect clas¬ 
sification and 0.5 being equivalent to random guessing In our case, each 


point on the ROC curve corresponds to a thresholded version of E^ m , i.e. 

Gr 

the k-th point from the left corresponds to making a prediction only for the 
k — 1 largest (minimal absolute value) causal effects. 


The results can be seen in Figure [5] with an AUC of 0.78. While this sug¬ 
gests that perfect graph discovery is usually not achieved, causal effects can 
be identified to some extent. The computations took 2.5 hours on an AMD 
Opteron 6174 processor using 20 cores. 

4.2 Genomic Data 


We also applied our method to a well-known genomics data set (Sachs et al. 


2005), where the expression of 11 proteins in human T-cells was measured 
under 14 different experimental conditions. There are likely hidden con- 
founders, which makes this setting suitable for hidden variable models. How¬ 
ever, it is questionable whether the bow-freeness, linearity, and Gaussianity 
assumptions hold to a reasonable approximation (in fact the data seem not 
to be multivariate normal). Furthermore, there does not exist a ground 


8 Some care has to be taken because of the fact that the cases ( Eq ln )ij > 0 and 
(Eg“)ji > 0 exclude each other, but we took this into account when computing the false 
positive rate. 
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ROC Curves 



Figure 5: ROC curves for causal effect discovery for N = 100 simulation 
runs of BAPs with d = 10 nodes and a maximal in-degree of a = 2. Sample 
size was n = 1000, greedy search was repeated R = 100 times at uniformly 
random starting points. The average area under the ROC curves (AUC) 
is 0.78. The thick curve is the point-wise average of the individual ROC 
curves. 


truth network (although some individual links between pairs of proteins are 
reported as known in the original paper). So we abstain from presenting a 
“best” network, but instead compare the model fit of BAPs and DAGs from 
a purely statistical point of view. 

To do this, we first log-transform all variables since they are heavily 
skewed. We then run two sets of greedy searches for each of the 14 data 
sets: one with BAPs and one with DAGs. For the BAPs we use 100 and 
for the DAGs we use 1000 random restarts (DAG models can be fit much 
faster then BAP models). The results for datasets 1 to 8 can be seen in 
Figure t * The penalized likelihood scores of the best BAP models are 


always much higher than the corresponding scores for the DAG models, 
despite the larger number of random restarts for the DAGs. Furthermore, 
while there is a marked improvement from the random starting scores for 
BAPs, the improvement for the DAG scores is marginal. For these datasets, 
BAPs seem to be superior to DAGs in modelling the data. The computations 
took 4 hours for the BAP models and 1.5 hours for the DAG models on an 
AMD Opteron 6174 processor using 20 cores. 


9 The results for the other 6 datasets are qualitatively similar. 
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BAP; Dataset 7; Score=-54.94 DAG; Dataset 7; Score=-79.76 BAP; Dataset 8; Score=-39.27 DAG; Dataset 8; Score=-92.22 



Figure 6: Greedy search runs on the first 8 of 14 genomic datasets with BAPs 
and DAGs. The x-axis is time in seconds, the y-axis is the (normalized) 
penalized likelihood score. Each path corresponds to a run of a greedy 
search with a different random restart, with each point on the path being a 
state visited by this greedy search run. 
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5 Conclusions 


We have presented a structure learning method for BAPs, which can be 
viewed as a generalization of Gaussian linear DAG models that allow for 
certain latent variables. Our method is computationally feasible and the 
first of its kind. The results on simulated data are promising, keeping in 
mind that structure learning and inferring causal effects are difficult, even 
for the easier case with DAGs. The main sources of errors (given the model 
assumptions are fulfilled) are sampling variability, finding a local optimum 
only, and not knowing the equivalence classes. Local optima are a general 
weakness of many structure learning methods in Bayesian networks since 
this problem is NP-hard in general (Chickering 1996). In our simulations, 
overestimating the equivalence class leads to too few causal effects, while the 
opposite happens if we underestimate it. On the other hand, our approach of 
greedily approximating the empirical equivalence class is building on the idea 
that some models are statistically indistinguishable, due to limited sample 
size and estimation error. Therefore, our approach has the advantage that it 
can include neighboring equivalence classes, that score almost as well, which 
is desirable from a statistical point of view. Our theoretical results about 
model equivalence go some way towards characterizing the distributional 
equivalence classes in BAP models and allow us to efficiently approximate 
them empirically. 

In many applications, not all relevant variables are observed, so allowing 
for hidden variables in the model should improve both the statistical fit as 
well as the interpretability in these cases. The former could be confirmed by 
applying our method to some real datasets. When comparing our method 
with traditional DAG models, that do not allow for hidden variables, we 
could demonstrate a marked improvement of the likelihood scores. 


A Appendix 

A.l Distributional Equivalence 
A. 1.1 Necessary Conditions 

The following Lemma shows that the point ( B , Q) = (0, 1) is non-singular 
for the map cf : ©g —> Sq for any BAP G. 

Lemma 2. Let G be a BAP with parameters (B,Ll). Then (B,Q) are 
uniquely identifiable at T, = I (or indeed at any diagonal T,). 

Proof. We proceed by induction on d, the number of vertices in G. If d = 1 
then the result is trivial since B = 0. 

Otherwise assume without loss of generality that the last vertex d has 
no children. The result holds for the subgraph of the remaining vertices by 
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the induction hypothesis. We know that X is of the form 


X = (I d - B - B )~ t , 


and we may deduce by the induction hypothesis that the first d — 1 rows of 
B are zero, and the upper (d — 1) x (d — l)-sub matrix of is the identity 
matrix. Hence 


n = 


Id-i u \ 

u T U) dd ) 


where u J = (u ld ,.. .,u d - M ), and 


(id - b r 1 


Id -1 0 
~(3 t 1 


Id -1 0 \ 

f3 T 1 ) 


where (3 1 = (f3 dl ,. 


i,d— 1 J 


Hence 


X = (I - B)~ 1 fl(I - B) 


\—T 


Id-1 

P T 

Id-1 

(3 T 


Id-i 

Id-1 

U T 


LC 

Udd 
fi + U> 

fH 1 UJ + UJdd 


Id-1 

0 


/3 

1 


but note that (3 1 oj = 0 by the bow-free assumption, so we get 

v _ ( Id-1 (3 + u \ 

-~\f3 T + u T WPf+UM)' 

and hence f3 + u> = 0. Now note that for each j, either (3 d j = 0 or Uj d = 0 by 
the bow-free assumption; hence (3 + u> = 0 implies that (3 = u = 0, leaving 



and hence ui dd = 1. This completes the result. □ 

Corollary 1. Let G be a BAP. Within a sufficiently small neighborhood of 
X = I, if &ij = 0 (for i f j) then this implies that = fij = 0. 

Proof. Since 4> is nonsingular and differentiable at #o = (0,1), its partial 
derivatives are defined and given by ^y-(^o) = 1 an d = 1 (this 

can be shown via a Taylor expansion for example). Therefore, in a small 
neighborhood around 4 >( 0 q) we have &ij = 0 only if = uiji = (3ij = 0 . □ 
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Note that Lemma [2] allows a direct proof of the fact that having the same 
skeleton is necessary for BAPs to be distributionally equivalent by looking 
at the tangent spaces of the models at £ = I and showing that they are 
determined by the skeletons of the graphs. 

In the proof of Theorem [l] we make use of the language of polynomial 
varieties (see Cox et al. (2007]) for an overview). A variety is a set defined 
by the zeros of some collection of polynomials (in our case polynomials in 
the entries of £), and all SEM models are varieties. 

Let G be a BAP with vertices V = WOW. Let Bq be the set of matrices 
B 6 Bg such that only entries corresponding to directed edges in G between 
vertices in W have non-zero coefficients. Similarly, let Oq be the set of 
D G Oq such that entries corresponding to edges outside W are zero and 
diagonal entries outside W are 1. 

Define a model Sq as the image of the map cf applied to (Bq', Oq'). 
So in other words, we only manipulate parameters in G that correspond 
to vertices and edges in Gw- Clearly the resulting model is isomorphic to 
Sg w , since this is the same setup as for the BAP Gw , but with the matrices 
extended to include independent vertices in W = V \ W. 


Let Tw be the set of covariance matrices £ on V such that £ 


ww 


= I 


and ^ww = b (be. so that vertices outside W are completely independent). 

We will show that looking at the set of covariance matrices in Sg that 
are also in Tw is essentially the same as the set Sq . Since the first set is a 
property of the full model, and the second set is determined by the subgraph 
Gw , this will be enough to prove Theorem [l] 

Proof of Theorem [7} What we need to prove is that Sg = S’g implies Sq w = 
S G ’ W ■ Consider again the variety Tw defined above. 

Clearly Sg = Sc implies Tw n Sg = Tw n Sc ■ We will show that the 
irreducible component of Tw H Sg which contains £ = I is the same as Sq' ; 
since this last quantity is isomorphic to Sg w we will prove the result. 

First, note that Sq C Tw and S g C Sq, so clearly Sq' C Sg n Tw¬ 
in addition, note that by Corollary [lj in a neighborhood of £ = I every 
element of Tw H Sg is also contained in Sq . It follows that the entire 
irreducible component of Tw^Sg containing £ = I is contained within Sq ', 
and therefore that the irreducible component of Tw H Sg containing £ = / 
is S%. □ 

We can now prove the Theorem giving necessary conditions for BAP 
equivalence. 


Proof of Corollary^ 7j Let us first consider vertex pairs, i.e. W = {i,j}- By 
Theorem [I] we have Gw being distributionally equivalent to G' w . If Gw / 
G' w we would have i _LL m j in one of the graphs but not the other, and 
using Lemma [l] this would lead to a contradiction. Hence Gw = G' w for 
any vertex pair, and hence G and G' must have the same skeleton. 


23 










Let us now consider vertex triplets W = {i,j, k}, such that (without loss 
of generality) there is a v-structure at j in Gw- Then i _LL m k in Gw and 
by the same argument as above we must have i _LL m k also in G' w . This 
is only possible if there is a v-structure at j in G' w . Hence G and G' must 
have the same v-structures. □ 


A.1.2 Sufficient Conditions 

We first make precise the definition of an important class of paths: treks. 
These are paths that do not contain colliders. We adopt the notation of 
Foygel et al. (2012). A trek r from i to j can have one of the following 


forms: 


Vi 


Vq <- 




ir R 


or 


Vi i — ■ * * t— Uo —y * * * —y 

where vjf = i and v^ = j and in the second case vq = Vq = Vq. Accordingly, 
we define the left-hand side of r as Left(r) = vj J <—■■■<— Vq and the right- 
hand side of r as Right(r) = —»•••• —»• vjf. Note that there is nothing 

inherently directional about a trek other that the (arbitrary) definition which 
end node is on the left. That is, every trek from i to j is also a trek from 
j to i just with the left and right sides switched. We denote the lengths of 
the left- and right-hand sides of a trek r by A l(t) and \r(t) respectively. 
If r does not contain a bidirected edge, we define its head to be H T = vo, 
otherwise H T = 0. If the left- and right-hand sides of r do not intersect 
(except possibly at H T ), we call r simpl ^ j We define the following sets 
that will be useful later: 

Vq = {n | 7r is a directed path from i to j in G}, 
rj = {t | t is a trek from i to j in G}, 

Sq = {t | t is a simple trek from i to j in G}. 

We will usually drop the subscript if it is clear from context which is the 
reference graph. 

We now show some intermediate results that are well-known, but we 
prove them here for completeness nevertheless. All of these apply more 
generally to path diagrams (possibly cyclic and with bows). 

Lemma 3. Let B £ W dxd such that every eigenvalue A of B satisfies |A| < 1. 
Then (I — B)^ 1 exists and is equal to ■ 

10 Note that each side might well be self-intersecting, if the corresponding graph is cyclic. 
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Proof. First note that det(AI — B) = 0 only if |A| < 1, hence det (I — B) / 
0, and therefore ( I — B)~ l exists. The eigenvalue condition also implies 
lirn^oo B l = 0, therefore 

OO l 

(/ — B) B s = lim y^{B s - B s+1 ) = lim (/ - B l+1 ) = /, 

J l—yoo ^ J /—>oo 

s=0 s=0 

and the result follows. □ 

Lemma 4. Let G be a path diagram over d nodes and B E Bq. Then 

(b%= ^ n b ‘■■ 

7 r£'D ji s->-te7r 
X(tt)=1 

Proof. By induction on l. For l = 1 the claim follows from the definition of 
Bq. Using the inductive hypothesis we get 

d d 

(B l ) ij = {BB l - 1 ) ij = Y J B ik (B l ~ 1 ) kj = Y J B l k Y. II B 

k= 1 k =1 s^rt£n 

\(n)=l —1 

and the claim follows, since every directed path from j to i of length l can 
be decomposed into a directed path n of length l — 1 from j to some node 
k and the edge k —» i. □ 

Lemma 5. Let G be an acyclic path diagram over d nodes and B E Bq. 
Then ( I - B)~ l = I + B + ... + B^ 1 . 

Proof. Since G is acyclic, there is an ordering of the nodes, such that B is 
strictly lower triangular and hence all its eigenvalues are zero. Furthermore, 
the longest directed path in G has length d — 1. Therefore the result follows 
from Lemma |3] and Lemma ® □ 

The following theorem is a version of Wright’s theorem that applies to 
non-standardized variables. It does not require a proper parametrization 
(in the sense that U needs to be positive definite). This result is probably 
known to experts, but we could not find a proof in the literature. 

Theorem 3. Let G be a (possibly cyclic) path diagram over d nodes, B E 
Bq, and U E R. dxd such that Ll is symmetric (but not necessarily positive 
definite) and ftij = 0 if i «->■ j is not an edge in G. Then the entries of the 
matrix (j) = (/ — B) _1 U(J — B)~ T are given by 

= y n Bts n ^ + y n Bt s -4>H T H T 

T£S i: > s—>tGr s< 4 (Gr rSiS^ 

eer 

^ = y n Bu n Qst + y n Bts ■ ft Hr Hr + ftll ■ 

s— s-H-tGr reT" s^t.£r 

-H-Sr 
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Proof. Let us write 


c e (r;B,n)= B ts 9. st 

S^tdT S^t^T 

as a shorthand for the edge contributiorPI of a trek r given parameter ma¬ 
trices B and Q. We write c(t;B,£ 1) = c e (r; B,Q) ■ Qh t h t for the total 
contribution of r (where we define Qh t h t to be 1 if r contains a bidirected 
edge and therefore H r = 0). 

Using Lemma [ 3 J we can expand as = Y^kLo ^£0 B k Q(B l ) T . We 
now first show the following intermediate result, which interprets the entries 
of these matrices as contributions of certain treks: 


(B k Q(B l ) T ) ij = Y c(r-B,n) + n ii l{i = j}, (8) 

reT ij 
A l(t)=& 

A r(t)=1 


for integers k > 0, l > 0. To see this, we expand the double matrix product 
and use Lemma [4] to get 


(. B k n(B l f) ij 


d d 


EE<«‘ )j'6 

a =1 6=1 


d d 


EE 



) 



\ 

E 

n b “ 

^ ab 

e n 

Bts 

L nev ai 

\A(7 r)=k 

s— 


we v b i 8 ~ne 7T 

/ 


and Q follows since each bracketed expression corresponds to one side of 
the trek from i to j via a and b (and the diagonal entries of do not 
correspond to a trek, so they are separate). Now summing over k and l 
gives the following 


<t>ij= Y c ( T ’ B ,ty + = j}, (9) 

reT ij 


which gives the result for the diagonal entries (fa. 

For the off-diagonal entries <pij , we can get a simpler expression involving 
only simple treks and the diagonal entries <f>u. Note that every trek r can 
be uniquely decomposed into a simple part £(t) and a (possibly empty) 
non-simple part p(r) (we just split at the point, where the right- and the 
left-hand sides of r first intersect). Since 


c(r) 


c e (£(r)) • fiff f(T) jj e(T) if p(r) = 0 

c e (£(r)) • c(p(t)) otherwise, 


11 That is, the contribution depending only on the edge labels and not the diagonal 
elements of fl. 
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(dropping the parameter matrices B and Ll in our notation), we can factor 
out the contributions of the simple parts. Note that if the simple part £(r) 
contains a bidirected edge, then p(r) must be empty and = 1 - 

Hence ®> becomes 

<t>a = Y c ^ c ( r ) + c ( r ) 


tGT 13 


rer iJ 

p(r)^0 


tGT' 11 

p(r)=0 


Y Ce(C(r))+ Y c e(^( r )) • c(p(r)) + Y C ^( T )) • n He (r) if, 


C(t) 


tgT^ 


rGT i:i 

p(j)j=0 


tgT^ 

p(t)=0 


Y ° e (£) + X] c (^)+ ce (£) ■ 

p £ T H S H S 








y: c e(o + + ) > 


£eS’ J 


£eS’ J 


KpGT H t H t 


and the result follows. 


□ 


The following version for standardized parameters is often quoted as 
Wright’s theorem. 

Theorem 4. Let G be a (not necessarily acyclic) path diagram over d nodes, 
B E Bq, and H E M. dxd such that Ll is symmetric (but not necessarily positive 
definite) and flij = 0 if i E> j is not an edge in G. Furthermore assume 
that we have standardized parameters B,Ll such that (4>(B,Q))a = 1 for all 
i. Then the off-diagonal entries offi(B,Ll) are given by 

(<KB,n)) ij = E 11 B » 11 n “- 

tgS 1 ^ s^t-Gr s-Grt-Gr 

Proof. This is a direct consequence of Theorem [3| |3| 

We can now prove Theorem [2j which is a consequence of Wright’s for¬ 
mula. 


Proof of Theorem [1| Let 9g 1 E ©Gi and choose 9g 2 = (- 62 ,^ 2 ) such that 
their edge labels agree, that is, 


( B 2 )ij 


( Bf)ij if i <— j G G 1 , i E- j E G2, 

< if i EE j E G\, i E- j E G2, 

,0 if i <- j £ G 2 , 
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and 


(^2 )ij 


{B\)ij if i <- j G Gi, i •(-> j G G2, 

' if i ++ j € G\, i E* j E G 2 , 

,0 if i ^ j ^ G 2 . 


This is possible since G\ and G 2 have the same skeleton: we just assign the 
edge labels of G± to G 2 , irrespective of the edge type. The diagonal entries 
of fl 2 are still free—we now show that they can be used to enforce 


(4>(B 2 , £l 2 ))u — 1 


( 10 ) 


for all i, which defines a linear system for the diagonal entries of fi 2 . Let 
d = diag( 0 2 ) be the vector consisting of the diagonal elements of Q 2 . and 
write (fTol) as Md = 1 , where M is the coefficient matrix of the linear system. 


To show that (10) always has a solution, we need to show that det(M) 7 ^ 0. 


Without loss of generality, assume that the nodes are topologically ordered 
according to G 2 (this is possible since G 2 is assumed to be acyclic), that 
is, there is no directed path from i to j if i > j. Then we have H r < i 
(or H T = 0 ) for all r E T n , and using the expression for <f>a in Theorem [ 3 ] 
we see that M must be lower triangular with diagonal equal to 1. Thus 


det(M) = 1, and we can enforce (10). 

Since G\ and G * 2 share the same collider triples, the sets of simple treks 
between any two nodes are the same in both graphs: S 7 ^ = Sq c> Vi,j. 
Together with Theorem [4] and the fact that the edge labels agree this shows 
that 


0(0Gi) = 4>{0g 2 )- 


( 11 ) 


What is left to show is that fl 2 is a valid covariance matrix, that is, it is 
positive semi-definite. By ^ and (11) we have that 


n 2 = (i-B 2 )x 1 (i-B 2 ) T , 

where Si = Since Si is positive semidefinite, so is fl 2 . 


P 


A.2 Likelihood Separation 

Since we can write e = e(X) as a function of X = (Xj,... ,X^), we have 
that their densities satisfy 

Pg(X 1 , ...,X d )= pUe i(X),..., e d (X)). (12) 

The joint density for the errors e can be factorized according to the indepen¬ 
dence structure implied by J). Let us adopt the notation {X}/ := {Xj}j g / 
and {e}/ := for some index set I. Then we have {e}c fe -1L {c}Ci 
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Vk ^ l. Furthermore, we implicitly refer to marginalized densities via the 
arguments, i.e. we write p G {{e}c \) for the marginal density of {e}ci- We 
can thus write 


p e G (ei, ...,e d )= p G {{e} c J ■ ■ -PgH^Ck)- 


Hence (12) becomes 
Pg( x i,-- 


jepa(i) 




(13) 


Each factor depends only on the nodes in the respective component Ck and 
the parents of that component pa(Cfc). By the same argument the joint 
density of the submodel Gk is 


PaJ{X}v i )=p‘aJMc J II PaJti'l 

iepa(C fe ) 

=*>&„({*<- e ) n 14® 

j'epa(j) iepalCfc) 


This factorization is symbolic, since the parents {^j}j£pa(c k ) will not be 
independent in general. This does not matter, however, since these terms 
cancel when reconstructing the full density p G (X],... ,X d ) later. The ad¬ 
vantage of this symbolic factorization is that we can still fit the (wrong) 
submodel and then use the easier to compute product of marginal densities 
to reconstruct the full density, rather than doing the same with the actual 
submodel factorization and the joint density of the component parents. 

Note that 

)-,«,({*- £ B.X,} ), 

V jepa(i) fe/ V iG P a(i) k/ 


that is, the conditionals {X} Gk \{X} p ^ Gk ^ are the same in models G and 
Gfc. This is because the structural equations of {X} Gk are the same in these 
models. Note also that p Gk (Xj ) = p Gk (Xj ) for all j G pa(Gfc) and all k, 
since pa(Cfc) are source nodes in model G & (all edges between them were 
removed). 


Thus we can reconstruct the full joint density (13) from joint densities 


of the connected component submodels and marginal densities of the parent 
variables: 


pX(x 1 ,...,x d )=n P x k ({x} Vk )- n px^) 


U'epa(Cfc) 
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Writing V for the observed data {xf} (with 1 < i < d and 1 < s < n), the 
log-likelihood can then be written as 


Kpg', v ) = ^2^gp^(x[ s \...,x^) 


S=1 


E E ( lo § PG k ({xi S) }ieV k ) - E log PG k (x < f ) '' 

jepa(Cfc) 


S=1 k 


EI *<*& G‘ , }S!r”> - E '(A; {*fr>.”)) • 

j£pa,(C k ) 


where 1(pq ; 


refers to the likelihood of the Aj-marginal of PQ k - 


A.3 Symmetry and Irreducibility of Markov Chain 

We show that the transition matrix of the Markov Chain described in Al¬ 
gorithm [l] is symmetric and irreducible. For two BAPs G,G', let P(G,G') 
be the probability of a single step transition from G to G'. 

Theorem 5. We have 

1. Symmetry: P(G, G') = P(G',G). 

2. Irreducibility: 3Gi,..., G n such that 


P(G,G i) 


/ n— 1 


}JP(G h G i+ 1 ) 


^ i —1 


P(G n -i, G') > 0 . 


Proof. Let p be the probability of sampling one position i.e. p = 

l/(d(d — 1)). Let us hrst consider the case where G and G' only differ by 
one edge addition, i.e. WLOG either 

• G = G' U i ->• j. 

• G = G'um j. 


In both cases we get P(G, G') = p/ 2 = P(G', G). In the first case, by mul¬ 
tiplying the probabilities along the branches of [la] and 2a in Algorithm [l] 
respectively, and since G has no cycles. In the second case, by multiply¬ 
ing the probabilities along the branches of la and 2b respectively. Hence 
symmetry holds, and irreducibility is trivially true in this case. 

For the general case, note that the transitions described in Algorithm [l] 
involve either edge additions or deletions, so if G, G' do not differ by only 
one edge addition, we have P(G,G') = P(G',G) = 0. Furthermore, we 
can always find a collection of graphs G i,... ,G n , such that irreducibility 
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holds, e.g. by successively removing edges from G until the graph is empty 
and then successively adding edges until we arrive at G'. Then we have 
P{Gi, Gi- |_i) = p/2 > 0 for all 1 < i < n by the case considered above, and 
the claim follows. □ 
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